In a previous post, I began an investigation into the Motor Vehicle Collisions - Crashes dataset. This personal project is meant to help me (a) practice my geospatial data skills, (b) replicate the spatial dependency model used by Morris et al. (2019), and (c) potentially extend that model to include a time component, which would theoretically allow me to model whether the congestion pricing zone has impacted vehicle collision rates in 2025. In the interest of breaking this project down into digestible and achievable chunks, I have decided that this, part 2 of 3, will focus on purely spatial modeling. Fortunately, Mitzi Morris published this guide to the topic, which I’ll be following.
I outline most of the requisite data sources for the project in the previous post, although for this post I’m adding some tract-level traffic/built environment data (fragmentation index and traffic volume), as well as some ACS covariates (percentage of commuters that use public transit, population, median income, median age). My process here is to summarize the point-level crash data to the census tract level and then join it with the covariates.
One note: the fragmentation index and traffic variables provided by Morris only cover 2095 tracts (out of 2315 listed). There were also some ACS variables with NA values in a couple rows. Because I’m not trying to save the world, just figure out how to model things, I simply removed these from the analysis. In a different world (or perhaps the future), I would either find more complete data, determine which tracts I’m actually okay with filtering out (e.g., parks?), or interpolate the data in some way.
Below, please find a data table with the final dataset that I’m using.
# Loading NYC Census Tracts: https://data.cityofnewyork.us/City-Government/2020-Census-Tracts/63ge-mke6/about_data
nyc_tracts <- read.csv("data/2020_Census_Tracts_20250606.csv", stringsAsFactors = FALSE) %>%
rename("geometry" = "the_geom")
nyc_tracts <- st_as_sf(nyc_tracts, wkt = "geometry", crs = 4326)
nyc_tracts <- nyc_tracts %>%
mutate(
area_sqm = Shape_Area / 27878400
) %>%
select(
geometry,
BoroCT2020,
BoroCode,
BoroName,
area_sqm,
GEOID
)
# Loading collisions data: https://data.cityofnewyork.us/Public-Safety/Motor-Vehicle-Collisions-Crashes/h9gi-nx95/about_data
filter <- c(0, NA)
crashes <- read.csv("data/Motor_Vehicle_Collisions_-_Crashes_20250605.csv") %>%
select(CRASH.DATE,
LATITUDE,
LONGITUDE,
NUMBER.OF.PERSONS.INJURED,
NUMBER.OF.PERSONS.KILLED,
NUMBER.OF.PEDESTRIANS.INJURED,
NUMBER.OF.PEDESTRIANS.KILLED) %>%
filter(! LATITUDE %in% filter,
! LONGITUDE %in% filter) %>%
rename(
"date" = "CRASH.DATE",
"lat" = "LATITUDE",
"long" = "LONGITUDE",
"persons_inj" = "NUMBER.OF.PERSONS.INJURED",
"persons_death" = "NUMBER.OF.PERSONS.KILLED",
"ped_inj" = "NUMBER.OF.PEDESTRIANS.INJURED",
"ped_death" = "NUMBER.OF.PEDESTRIANS.KILLED"
) %>%
mutate(
date = mdy(date)
) %>%
st_as_sf(coords = c("long","lat"), crs = 4326)
# Grabbing some census variables via Tidycensus, using Keyring for API Key privacy
tidycensus_api_key <- key_get(service = "tidycensus_API", username = "my_tidycensus")
census_api_key(tidycensus_api_key)
census_vars <- get_acs(state = "NY",
county = c("Bronx", "Kings", "New York", "Queens", "Richmond"),
geography = "tract",
variables = c(medincome = "B19013_001",
population = "B01003_001",
median_age = "B01002_001",
transport_baseline = "B08301_001",
transport_pubtransit = "B08301_010"),
geometry = FALSE,
keep_geo_vars = FALSE,
year = 2022,
output = "wide"
) %>%
mutate(
GEOID = as.numeric(GEOID),
median_income = medincomeE,
population = populationE,
median_age = median_ageE,
prop_pubtransit = transport_pubtransitE / transport_baselineE
) %>%
select(
GEOID,
median_income,
population,
median_age,
prop_pubtransit
)
# Associating CP Zone, Pre/Post, Treatment, Borough, and Census Tract w/ Observations
crashes <- crashes %>%
st_join(nyc_tracts, join = st_within) %>%
filter(! is.na(area_sqm))
# Aggregating/summarizing data to census tract level, no time component
crashes_tract <- crashes %>%
group_by(BoroCT2020) %>%
summarize(tot_crashes = n(),
area = mean(area_sqm)) %>%
ungroup() %>%
st_drop_geometry()
# Getting Fragmentation Index from: https://github.com/mitzimorris/geomed_2024/blob/main/data/nyc_study.geojson
frag_data = st_read(file.path("data", "nyc_study.geojson"), quiet = TRUE) %>%
st_drop_geometry() %>%
select(BoroCT2010, frag_index, traffic) %>%
mutate(
BoroCT2010 = as.numeric(BoroCT2010)
)
# Joining everything together, selecting only variables that I want
crashes_tracts_geo <- nyc_tracts %>%
right_join(crashes_tract,
by = "BoroCT2020") %>%
left_join(census_vars,
by = "GEOID") %>%
left_join(frag_data,
by = c("BoroCT2020" = "BoroCT2010")) %>%
select(! c(area)) %>%
filter(! if_any(everything(), is.na))
# Interactive Data Table for Display
crashes_dt <- crashes_tracts_geo %>%
st_drop_geometry() %>%
mutate(
area_sqm = round(area_sqm, 3),
prop_pubtransit = round(prop_pubtransit, 3)
)
datatable(crashes_dt,
extensions = 'Buttons',
filter = "top",
rownames = FALSE,
options = list(
autoWidth = TRUE,
scrollX = TRUE
),
class = 'compact',
escape = FALSE
) %>%
formatStyle(
columns = names(crashes_dt),
`white-space` = "nowrap",
`height` = "1.5em",
`line-height` = "1.5em"
)
Though I made some plots in the previous post, I’ll make some more here to hopefully illustrate (a) characteristics of the outcome variable and (b) visually explore how the predictor variables might be related to the outcome.
ggplot(data = crashes_tracts_geo, aes(x = tot_crashes)) +
geom_histogram(aes(y = after_stat(density)),
bins = 500,
color = "lightblue",
fill = "lightblue",
alpha = 0.75) +
geom_density(color = "#FFC600", linewidth = 1) +
theme_bw() +
labs(title = "Distribution Total Crashes per Census Tract (2024-25)",
x = "Number of Crashes",
y = "Density")
This distribution has a mean of 50, standard deviation of 35, and variance of 1225, which is not great news for the Poisson distribution (which assumes that mean = variance). We’ll see how that does
Not all the predictors have a super clear relationship with the outcome variable, but it’s a good idea to include them anyway.
It’s worth noting that, in my previous post, I looked at crashes per square mile, whereas this analysis simply uses the raw counts. I believe this is (a) because the Poisson model is for discrete count data and (b) the dependency models adjust for population. I’m not 100% clear on why they’re not using the per square mile measure, though.
crash_map <- tm_shape(crashes_tracts_geo) +
tm_polygons(fill = "tot_crashes",
fill.scale = tm_scale_intervals(values = "brewer.yl_or_rd",
style = "jenks"),
fill.legend = tm_legend(title = "Total Crashes, 2024-25"),
lwd = 0.3) +
tm_layout(legend.outside = TRUE,
legend.outside.position = "right")
crash_map
income_map <- tm_shape(crashes_tracts_geo) +
tm_polygons(fill = "median_income",
fill.scale = tm_scale_intervals(values = "brewer.greens",
style = "jenks"),
fill.legend = tm_legend(title = "Median Income"),
lwd = 0.3) +
tm_layout(legend.outside = TRUE,
legend.outside.position = "right")
medianage_map <- tm_shape(crashes_tracts_geo) +
tm_polygons(fill = "median_age",
fill.scale = tm_scale_intervals(values = "brewer.blues",
style = "jenks"),
fill.legend = tm_legend(title = "Median Age"),
lwd = 0.3) +
tm_layout(legend.outside = TRUE,
legend.outside.position = "right")
tmap_arrange(income_map,
medianage_map,
nrow = 1, ncol = 2)
prop_pubtransit_map <- tm_shape(crashes_tracts_geo) +
tm_polygons(fill = "prop_pubtransit",
fill.scale = tm_scale_intervals(values = "brewer.purples",
style = "jenks",
value.na = "grey"),
fill.legend = tm_legend(title = "Proportion that Commute Using Public Transit"),
lwd = 0.3) +
tm_layout(legend.outside = TRUE,
legend.outside.position = "right")
pop_per_sqm_map <- tm_shape(crashes_tracts_geo) +
tm_polygons(fill = "population",
fill.scale = tm_scale_intervals(values = "brewer.yl_gn",
style = "jenks"),
fill.legend = tm_legend(title = "Population"),
lwd = 0.3) +
tm_layout(legend.outside = TRUE,
legend.outside.position = "right")
tmap_arrange(prop_pubtransit_map,
pop_per_sqm_map,
nrow = 1, ncol = 2)
frag_map <- tm_shape(crashes_tracts_geo) +
tm_polygons(fill = "frag_index",
fill.scale = tm_scale_intervals(values = "brewer.pu_or",
style = "quantile",
midpoint = 0,
value.na = "grey"),
fill.legend = tm_legend(title = "Fragmentation Index"),
lwd = 0.3) +
tm_layout(legend.outside = TRUE,
legend.outside.position = "right")
traffic_map <- tm_shape(crashes_tracts_geo) +
tm_polygons(fill = "traffic",
fill.scale = tm_scale_intervals(values = "brewer.reds",
style = "quantile",
midpoint = 0,
value.na = "grey"),
fill.legend = tm_legend(title = "Traffic Volume"),
lwd = 0.3) +
tm_layout(legend.outside = TRUE,
legend.outside.position = "right")
tmap_arrange(frag_map,
traffic_map,
nrow = 1, ncol = 2)
Ok–so we’ve visualized ad nauseum. Now, we’ve got to set up a spatial weights matrix. Essentially, something that tells us which of the tracts neighbor each other. In a different setting, we may want to vary weights (greater weight to tracts closer to the central tract, decreasing weights as we move away), but here we just set the weight equal to 1 if the tracts are neighbors and 0 if they are not.
The graph below shows these binary links between tracts.
nyc_nbs = poly2nb(crashes_tracts_geo)
nyc_coords = st_coordinates(st_centroid(crashes_tracts_geo['geometry']))
# Create a spatial points sf object for centroids
centroids_sf <- st_as_sf(data.frame(x = nyc_coords[,1], y = nyc_coords[,2]),
coords = c("x", "y"),
crs = st_crs(crashes_tracts_geo))
# Plot with tmap
spdep_map <- tm_shape(crashes_tracts_geo) +
tm_polygons(col = "white") +
tm_lines(col = "white", lwd = 2) +
# Add points for centroids
tm_shape(centroids_sf) +
tm_dots(size = 0.25, col = "black") +
# Add lines for neighbors
tm_shape(st_as_sf(nb2lines(nyc_nbs, coords = nyc_coords))) +
tm_lines(col = "deepskyblue3")
spdep_map
In the outputs above and below, we can see that there are 2 regions with 0 links (islands), and 4 disjoint subgraphs, meaning networks that there are no paths between (Staten Island is a disjoint subgraph). This was done using the Queen’s case method (think of the squares that a Queen can reach on a chessboard–not only directly bordering neighbors, but diagonals as well.
Exactly what was needed in this department was a touch confusing to me. Essentially, I got the idea that (a) I should examine whether existing links in the graph made sense (e.g., if there’s a link across water, does it make sense to keep it in from a pedestrian-focused context?) and then (b) make sure the final graph is fully connected, because that’s a requirement of the ICAR model I want to use. In Morris’s example it seems that part (a) was mostly an exercise in spatial data manipulation, whereas the actual dataset used for the final analysis is the original graph with all of its cross-water connections and a few added links to make it a single component. The code below achieves this, and checks my work.
# Function to add a symmetric edge
add_symmetric_edge_nb <- function(nb, i, j) {
nb_res = nb
if (!(j %in% nb[[i]])) {
if (nb[[i]][1] == 0) {
nb_res[[i]] = j
} else {
nb_res[[i]] <- sort(c(nb[[i]], j))
}
}
if (!(i %in% nb[[j]])) {
if (nb[[j]][1] == 0) {
nb_res[[j]] = i
} else {
nb_res[[j]] <- sort(c(nb[[j]], i))
}
}
return(nb_res)
}
nyc_nbs_connected = poly2nb(crashes_tracts_geo)
nyc_nbs_connected = add_symmetric_edge_nb(nyc_nbs_connected,
which(crashes_tracts_geo$BoroCT2020 == 3016400),
which(crashes_tracts_geo$BoroCT2020 == 5001800))
nyc_nbs_connected = add_symmetric_edge_nb(nyc_nbs_connected,
which(crashes_tracts_geo$BoroCT2020 == 4088400),
which(crashes_tracts_geo$BoroCT2020 == 4107201))
nyc_nbs_connected = add_symmetric_edge_nb(nyc_nbs_connected,
which(crashes_tracts_geo$BoroCT2020 == 4107201),
which(crashes_tracts_geo$BoroCT2020 == 4094201))
nyc_nbs_connected = add_symmetric_edge_nb(nyc_nbs_connected,
which(crashes_tracts_geo$BoroCT2020 == 4099801),
which(crashes_tracts_geo$BoroCT2020 == 4103202))
nyc_nbs_connected = add_symmetric_edge_nb(nyc_nbs_connected,
which(crashes_tracts_geo$BoroCT2020 == 2045600),
which(crashes_tracts_geo$BoroCT2020 == 2042600))
nyc_nbs_connected = add_symmetric_edge_nb(nyc_nbs_connected,
which(crashes_tracts_geo$BoroCT2020 == 1029900),
which(crashes_tracts_geo$BoroCT2020 == 2026900))
# Checking Work
spdep_map <- tm_shape(crashes_tracts_geo) +
tm_polygons(col = "white") +
tm_lines(col = "white", lwd = 2) +
# Add points for centroids
tm_shape(centroids_sf) +
tm_dots(size = 0.25, col = "black") +
# Add lines for neighbors
tm_shape(st_as_sf(nb2lines(nyc_nbs_connected, coords = nyc_coords))) +
tm_lines(col = "deepskyblue3")
spdep_map
cat('is symmetric? ', is.symmetric.nb(nyc_nbs_connected), '\n')
## is symmetric? TRUE
summary(nyc_nbs_connected)
## Neighbour list object:
## Number of regions: 1956
## Number of nonzero links: 10800
## Percentage nonzero weights: 0.2822839
## Average number of links: 5.521472
## 7 disjoint connected subgraphs
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 19 52 144 287 461 469 313 150 39 12 6 2 2
## 19 least connected regions:
## 55 257 259 521 581 582 644 835 839 893 1210 1249 1402 1490 1576 1591 1605 1635 1893 with 1 link
## 2 most connected regions:
## 1718 1785 with 13 links
cat('nodes per component')
## nodes per component
table(n.comp.nb(nyc_nbs_connected)$comp.id)
##
## 1
## 1956
We can see here that the graph is symmetric, and that there is only a single component listed, with 1956 nodes. One confusing thing is that there are still 7 disjoint subgraphs in the output. I genuinely don’t know why this is and I cannot figure it out, so if you’re reading this and you do know, please email me.
Finally, the fun part. Now, I know that model building should be an extremely stepwise and iterative process–skipping steps is likely to result in issues. That said, I want to balance this with post length, so I’m roughly planning to include one model + output for each type of model I’m building.
poisson_1_model_file = file.path('stan', 'poisson_1.stan')
cat(readLines(poisson_1_model_file), sep="\n")
## data {
## int<lower=0> N;
## array[N] int<lower=0> y; // count outcomes
## vector<lower=0>[N] E; // exposure
## int<lower=1> K; // num covariates
## matrix[N, K] xs; // design matrix
## }
## transformed data {
## vector[N] log_E = log(E);
##
## // center continuous predictors
## vector[K] means_xs; // column means of xs before centering
## matrix[N, K] xs_centered; // centered version of xs
## for (k in 1:K) {
## means_xs[k] = mean(xs[, k]);
## xs_centered[, k] = xs[, k] - means_xs[k];
## }
## }
## parameters {
## real beta0; // intercept
## vector[K] betas; // covariates
## }
## model {
## y ~ poisson_log(log_E + beta0 + xs_centered * betas);
## beta0 ~ std_normal();
## betas ~ std_normal();
## }
## generated quantities {
## real beta_intercept = beta0 - dot_product(means_xs, betas); // adjust intercept
## array[N] int y_rep;
## vector[N] log_lik;
## {
## vector[N] eta = log_E + beta0 + xs_centered * betas; // centered data
## y_rep = max(eta) < 26 ? poisson_log_rng(eta) : rep_array(-1, N);
## for (n in 1:N) {
## log_lik[n] = poisson_log_lpmf(y[n] | eta[n]);
## }
## }
## }
So, this is not my model, it’s the simple Poisson model that Morris uses. So, rather than understanding it because I wrote it, I’m going to understand it because I explain it to you in this section. First, the data:
The transformed data block takes the log of population (E), which is a generally good principle for data that’s constrained to be all positive. It also means that the coefficient goes to the multiplicative scale, which is nice for interpretation.
In the parameters block, Morris defines an intercept, beta0, and a vector of coefficients, betas, for the covariates. The beta0 coefficient is split out so it can be applied to the log_E data.
The model specifies that y comes from a Poisson distribution, and gives beta0 and betas simple standard normal distributions (mean 0, sd 1).
Lastly, the generated quantities block. In Morris’s words, this is
for use in model checking and model comparison. Y_rep is used to hold
replicated data. Morris uses some optimization in this block that I
don’t entirely (or even really) understand, but this will be used to
conduct posterior predictive checks. That is, we can simulate fake data
and check their characteristics (mean, sd, quantiles) to see if the
model does a good job. Morris uses leave-one-out cross validation, and
uses the output from the loo() function–Expected Log
Predictive Density (ELPD)–which gives the log-probability of new data
given the model. A higher ELPD is better.
In the guide materials, Morris also makes some improvements to the model. Most notably, mean-centering the predictors. I ran the model without these adjustments and then added them. Output was, naturally, pretty similar. The improved model ran in only about 3 seconds per chain on my computer, as compared to 10-12 seconds for the original.
design_mat <- as.data.frame(crashes_tracts_geo) %>%
select(prop_pubtransit, median_income, median_age, frag_index, traffic) %>%
mutate(median_income = log(median_income),
traffic = log(traffic))
pois_data <- list(
N = nrow(crashes_tracts_geo),
y = as.integer(crashes_tracts_geo$tot_crashes),
E = as.integer(crashes_tracts_geo$population),
K = ncol(design_mat),
xs = design_mat
)
poisson_model_1 <- cmdstan_model(poisson_1_model_file)
set.seed(50)
poisson_fit_1 <- poisson_model_1$sample(data=pois_data, parallel_chains=cores, refresh=0)
## Running MCMC with 4 chains, at most 6 in parallel...
##
## Chain 1 finished in 2.0 seconds.
## Chain 2 finished in 2.0 seconds.
## Chain 3 finished in 2.1 seconds.
## Chain 4 finished in 2.3 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 2.1 seconds.
## Total execution time: 2.5 seconds.
pois_summary <- poisson_fit_1$summary()
vars <- c('beta_intercept', 'beta0', 'betas[1]', 'betas[2]', 'betas[3]', 'betas[4]', 'betas[5]')
pois_summary_subset <- pois_summary[pois_summary$variable %in% vars, ]
numeric_cols <- sapply(pois_summary_subset, is.numeric)
pois_summary_subset[numeric_cols] <- round(pois_summary_subset[numeric_cols], 3)
print(pois_summary_subset)
## # A tibble: 7 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 beta0 -4.43 -4.43 0.003 0.003 -4.43 -4.42 1 2937. 2513.
## 2 betas[1] -0.196 -0.196 0.025 0.025 -0.237 -0.156 1.00 1649. 2157.
## 3 betas[2] -0.025 -0.025 0.007 0.007 -0.036 -0.013 1.00 1714. 1762.
## 4 betas[3] -0.006 -0.006 0 0 -0.007 -0.005 1 3763. 3166.
## 5 betas[4] 0.004 0.004 0.001 0.001 0.002 0.007 1.00 4238. 2839.
## 6 betas[5] 0.263 0.263 0.003 0.003 0.257 0.268 1 2838. 2529.
## 7 beta_intercept -6.45 -6.45 0.09 0.088 -6.60 -6.31 1.00 1640. 2038.
Ok, so here we have some output. First, the rhat values are all 1.00, which indicates the model mixed properly. At a basic level, the negative coefficients on the first three predictors (prop_pubtransit, median_income, median_age), indicate that higher levels of public transit use, median income, and median age are all associated with lower counts of crashes in a census tract. Conversely, higher fragmentation index (beta[5]) and traffic volume (beta[6]) are associated with higher counts of crashes. These all pass the sniff test.
This is where we’ll use the y_rep predicted values from the generated
quantities block of the Stan model. We’re hoping that the
characteristics (mean, sd, quantiles) of y_rep match the true data as
closely as possible. Morris included the helper functions
ppc_central_interval and ppc_y_yrep_overlay in
the “utils_dataviz.R” file.
y_rep_pois <- as_draws_matrix(poisson_fit_1$draws("y_rep"))
ppc_central_interval(y_rep_pois, pois_data$y)
## [1] "y total: 1956, ct y is within y_rep central 50% interval: 360, pct: 18.40"
The output above shows that only 18% of the observed data points fall within the 50% predicted interval, which is not great. The graph below illustrates this. It does seem that the model doesn’t handle low or high values all that well.
ppc_y_yrep_overlay(y_rep_pois, pois_data$y,
'Poisson model PPC\ny (blue dot) vs. y_rep (yellow 50% central interval, grey full extent)')
Morris makes the point that the data are overdispersed (we saw that coming when looking at the distribution of the outcome variable earlier), meaning that the variance is higher than we expect (“we” being the users of the single-parameter Poisson model). The recommended solution is to add a simple random effects component to the model, which accounts for per-tract heterogeneity. This is similar to adding indicator variables for each tract, but assumes these varying intercepts come from a distribution.
poisson_2_model_file = file.path('stan', 'poisson_2.stan')
cat(readLines(poisson_2_model_file), sep="\n")
## data {
## int<lower=0> N;
## array[N] int<lower=0> y; // count outcomes
## vector<lower=0>[N] E; // exposure
## int<lower=1> K; // num covariates
## matrix[N, K] xs; // design matrix
## }
## transformed data {
## vector[N] log_E = log(E);
##
## // center continuous predictors
## vector[K] means_xs; // column means of xs before centering
## matrix[N, K] xs_centered; // centered version of xs
## for (k in 1:K) {
## means_xs[k] = mean(xs[, k]);
## xs_centered[, k] = xs[, k] - means_xs[k];
## }
## }
## parameters {
## real beta0; // intercept
## vector[K] betas; // covariates
## vector[N] theta; // heterogeneous random effects
## real<lower=0> sigma; // random effects variance
## }
## model {
## y ~ poisson_log(log_E + beta0 + xs_centered * betas + theta * sigma);
## beta0 ~ std_normal();
## betas ~ std_normal();
## theta ~ std_normal();
## sigma ~ normal(0, 5);
## }
## generated quantities {
## real beta_intercept = beta0 - dot_product(means_xs, betas); // adjust intercept
## array[N] int y_rep;
## vector[N] log_lik;
## {
## vector[N] eta = log_E + beta0 + xs_centered * betas + theta * sigma; // centered data
## y_rep = max(eta) < 26 ? poisson_log_rng(eta) : rep_array(-1, N);
## for (n in 1:N) {
## log_lik[n] = poisson_log_lpmf(y[n] | eta[n]);
## }
## }
## }
So, the difference in the new model is that it includes additional parameters theta and sigma, which define the distribution that the varying intercepts are drawn from.
# Unchanged from previous model
design_mat <- as.data.frame(crashes_tracts_geo) %>%
select(prop_pubtransit, median_income, median_age, frag_index, traffic) %>%
mutate(median_income = log(median_income),
traffic = log(traffic))
pois_data <- list(
N = nrow(crashes_tracts_geo),
y = as.integer(crashes_tracts_geo$tot_crashes),
E = as.integer(crashes_tracts_geo$population),
K = ncol(design_mat),
xs = design_mat
)
poisson_model_2 <- cmdstan_model(poisson_2_model_file)
set.seed(50)
poisson_fit_2 <- poisson_model_2$sample(data=pois_data, parallel_chains=cores, refresh=0)
## Running MCMC with 4 chains, at most 6 in parallel...
##
## Chain 2 finished in 11.0 seconds.
## Chain 3 finished in 15.6 seconds.
## Chain 1 finished in 15.8 seconds.
## Chain 4 finished in 18.1 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 15.1 seconds.
## Total execution time: 18.2 seconds.
pois_summary <- poisson_fit_2$summary()
vars <- c('beta_intercept', 'beta0', 'betas[1]', 'betas[2]', 'betas[3]', 'betas[4]', 'betas[5]', 'sigma')
pois_summary_subset <- pois_summary[pois_summary$variable %in% vars, ]
numeric_cols <- sapply(pois_summary_subset, is.numeric)
pois_summary_subset[numeric_cols] <- round(pois_summary_subset[numeric_cols], 3)
print(pois_summary_subset)
## # A tibble: 8 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 beta0 -4.47 -4.47 0.013 0.014 -4.49 -4.45 1.01 284. 632.
## 2 betas[1] -0.043 -0.04 0.113 0.112 -0.234 0.139 1.02 203. 424.
## 3 betas[2] 0.002 0.002 0.031 0.033 -0.051 0.052 1.01 254. 501.
## 4 betas[3] -0.003 -0.003 0.002 0.002 -0.006 0 1.01 272. 378.
## 5 betas[4] 0.016 0.016 0.006 0.006 0.008 0.026 1.01 243. 459.
## 6 betas[5] 0.265 0.265 0.015 0.016 0.239 0.29 1.00 208. 663.
## 7 sigma 0.593 0.592 0.01 0.01 0.576 0.61 1.01 398. 818.
## 8 beta_intercept -7.01 -7.03 0.4 0.422 -7.62 -6.31 1.01 230. 567.
Alright, the rhat values look fine. Some of the coefficients are slightly different as a result of adding the varying intercepts, but I’m gonna breeze past interpretation for now. Let’s check the model with posterior predictive checks.
y_rep_pois <- as_draws_matrix(poisson_fit_2$draws("y_rep"))
ppc_central_interval(y_rep_pois, pois_data$y)
## [1] "y total: 1956, ct y is within y_rep central 50% interval: 1947, pct: 99.54"
Hmmmm…this is fishy, too. We want about 50% of the observations to fall within the 50% interval, not almost 100%. I guess this model is way over-fit, which probably makes it a good candidate for some leave-one-out cross-validation.
ppc_y_yrep_overlay(y_rep_pois, pois_data$y,
'Poisson model PPC\ny (blue dot) vs. y_rep (yellow 50% central interval, grey full extent)')
So, LOO-CV should allow us to compare the models on how they perform on unseen data.
loo_1_pois <- loo(poisson_fit_1$draws("log_lik"), save_psis = TRUE)
loo_2_pois <- loo(poisson_fit_2$draws("log_lik"), save_psis = TRUE)
loo_compare(loo_1_pois, loo_2_pois)
## elpd_diff se_diff
## model2 0.0 0.0
## model1 -16787.4 1034.9
The output above identifies the varying-intercepts model as a better fit to the data, which we kind of knew already. I am interested in those Pareto k diagnostic issues:
print("Model 1")
## [1] "Model 1"
pareto_k_table(loo_1_pois)
## Pareto k diagnostic values:
## Count Pct. Min. ESS
## (-Inf, 0.7] (good) 1955 99.9% 124
## (0.7, 1] (bad) 1 0.1% <NA>
## (1, Inf) (very bad) 0 0.0% <NA>
print("Model 2")
## [1] "Model 2"
pareto_k_table(loo_2_pois)
## Pareto k diagnostic values:
## Count Pct. Min. ESS
## (-Inf, 0.7] (good) 331 16.9% 147
## (0.7, 1] (bad) 1286 65.7% <NA>
## (1, Inf) (very bad) 339 17.3% <NA>
So, the second model does fit the data better, but the output in the second table tells me that upwards of 80% of the data points in my sample have unreliable LOO estimates. Morris mentions that this model handles overdispersion, but fails to distinguish between variance due to the spatial structure of the data and purely heterogeneous variance. If you think about it, this sort of makes sense. Model 2 doesn’t allow the tracts around a given tract to have any impact on the estimates for that tract, which is going to cause unreliability.
The ICAR model is a model of spatial dependency. Intuitively, it’s pretty simple: the value of a region often depends on its neighbors. In a census tract situation, this is a totally reasonable intuition–it’s pretty likely that census tracts next to each other are similar, whether that be in terms of who lives there, what the roads are like, etc. We’ll include an ICAR prior in our Stan model–accounting for the correlation between spatial units by linking the estimates spatially. I’m sure Morris or Andrew Gelman would be horrified by that imprecise explanation, but that’s why I’m not a PhD student.
We can return to that spatially-linked data from earlier, finally. Since I spent the time building a fully-connected NYC graph, I’m going to model the whole city.
icar_1_model_file = file.path('stan', 'icar_1.stan')
cat(readLines(icar_1_model_file), sep="\n")
## data {
## int<lower=0> N;
## array[N] int<lower=0> y; // count outcomes
## vector<lower=0>[N] E; // exposure
## int<lower=1> K; // num covariates
## matrix[N, K] xs; // design matrix
##
## // spatial structure
## int<lower = 0> N_edges; // number of neighbor pairs
## array[2, N_edges] int<lower = 1, upper = N> neighbors; // columnwise adjacent
## }
## transformed data {
## vector[N] log_E = log(E);
## // center continuous predictors
## vector[K] means_xs; // column means of xs before centering
## matrix[N, K] xs_centered; // centered version of xs
## for (k in 1:K) {
## means_xs[k] = mean(xs[, k]);
## xs_centered[, k] = xs[, k] - means_xs[k];
## }
## }
## parameters {
## real beta0; // intercept
## vector[K] betas; // covariates
## sum_to_zero_vector[N] phi; // spatial effects
## real<lower=0> sigma; // overall spatial variance
## }
## model {
## y ~ poisson_log(log_E + beta0 + xs_centered * betas + phi * sigma);
## beta0 ~ std_normal();
## betas ~ std_normal();
## sigma ~ std_normal();
## target += -0.5 * dot_self(phi[neighbors[1]] - phi[neighbors[2]]); // ICAR prior
## }
## generated quantities {
## real beta_intercept = beta0 - dot_product(means_xs, betas); // adjust intercept
## array[N] int y_rep;
## {
## vector[N] eta = log_E + beta0 + xs_centered * betas + phi * sigma;
## y_rep = max(eta) < 26 ? poisson_log_rng(eta) : rep_array(-1, N);
## }
## }
nbs_adj = nbs_to_adjlist(nyc_nbs_connected)
design_mat <- as.data.frame(crashes_tracts_geo) %>%
select(prop_pubtransit, median_income, median_age, frag_index, traffic) %>%
mutate(median_income = log(median_income),
traffic = log(traffic))
icar_data <- list(
N = nrow(crashes_tracts_geo),
y = as.integer(crashes_tracts_geo$tot_crashes),
E = as.integer(crashes_tracts_geo$population),
K = ncol(design_mat),
xs = design_mat,
N_edges = ncol(nbs_adj),
neighbors = nbs_adj
)
icar_model_1 <- cmdstan_model(icar_1_model_file)
set.seed(50)
icar_fit_1 <- icar_model_1$sample(data=icar_data, parallel_chains=cores, refresh=0)
## Running MCMC with 4 chains, at most 6 in parallel...
##
## Chain 2 finished in 14.3 seconds.
## Chain 1 finished in 17.6 seconds.
## Chain 4 finished in 21.5 seconds.
## Chain 3 finished in 23.9 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 19.3 seconds.
## Total execution time: 24.0 seconds.
icar_summary <- icar_fit_1$summary()
vars <- c('beta_intercept', 'beta0', 'betas[1]', 'betas[2]', 'betas[3]', 'betas[4]', 'betas[5]', 'sigma')
icar_summary_subset <- icar_summary[icar_summary$variable %in% vars, ]
numeric_cols <- sapply(icar_summary_subset, is.numeric)
icar_summary_subset[numeric_cols] <- round(icar_summary_subset[numeric_cols], 3)
print(icar_summary_subset)
## # A tibble: 8 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 beta0 -4.47 -4.47 0.004 0.004 -4.48 -4.46 1 6136. 2895.
## 2 betas[1] -0.377 -0.375 0.129 0.131 -0.591 -0.165 1.01 225. 552.
## 3 betas[2] 0.028 0.028 0.043 0.045 -0.043 0.1 1.02 127. 280.
## 4 betas[3] 0.006 0.006 0.002 0.002 0.002 0.01 1.02 272. 576.
## 5 betas[4] 0.046 0.046 0.008 0.008 0.032 0.06 1.02 145. 224.
## 6 betas[5] 0.285 0.284 0.016 0.015 0.26 0.311 1.02 237. 466.
## 7 sigma 1.14 1.14 0.021 0.021 1.11 1.18 1.01 392. 853.
## 8 beta_intercept -7.69 -7.70 0.537 0.552 -8.57 -6.81 1.02 171. 370.
Well then, the rhat values are still pretty low! Again, not going to fuss with coefficient interpretation just yet. Let’s check the fit with the same posterior tools as before, as well as an ICAR correlation matrix.
phi_icar <- as_draws_matrix(icar_fit_1$draws("phi"))
plot_icar_corr_matrix(phi_icar, 'ICAR correlation matrix')
I won’t lie, the correlation matrix that Morris put together (admittedly with only Brooklyn data) looks a bit “better” than mine. I gather that we’re supposed to see greater correlation along the diagonal, since the components are listed in roughly geographic order. This does seem like a plot, though, that’s heavily dependent on the order of your data…so I’m not super confident that mine is set up entirely the same as Morris’s.
Before the checks, it’s worth noting that apparently LOO doesn’t work for ICAR, which I suppose makes sense, given that removing an observation would impact the spatial dependency network.
We’ll replicate the steps from the basic Poisson model to check our fit.
y_rep_icar <- as_draws_matrix(icar_fit_1$draws("y_rep"))
ppc_central_interval(y_rep_icar, icar_data$y)
## [1] "y total: 1956, ct y is within y_rep central 50% interval: 1931, pct: 98.72"
ppc_y_yrep_overlay(y_rep_icar, icar_data$y,
'Poisson + ICAR model PPC\ny (blue dot) vs. y_rep (orange 50% central interval, grey full extent)')
So, we can see that the predictive coverage is still comparable to the more basic Poisson models that I ran. 98% coverage is still pretty high, but I believe this is because there’s only one observation per tract, so there’s not a great understanding of within-tract variability.
Morris basically makes the point that the ICAR component effectively accounts for the spatial dependence in the data while providing similar predictive coverage to the random effects Poisson model. The next step is to combine the two into a BYM2 model.
The Besag York Mollié (BYM) model has both a spatial dependency parameter and a random effects parameter. It uses a mixing parameter to get a blend between the two, depending on how much of the variation seen in the outcome variable is spatial.
bym2_1_model_file = file.path('stan', 'bym2_1.stan')
cat(readLines(bym2_1_model_file), sep="\n")
## data {
## int<lower=0> N;
## array[N] int<lower=0> y; // count outcomes
## vector<lower=0>[N] E; // exposure
## int<lower=1> K; // num covariates
## matrix[N, K] xs; // design matrix
##
## // spatial structure
## int<lower = 0> N_edges; // number of neighbor pairs
## array[2, N_edges] int<lower = 1, upper = N> neighbors; // columnwise adjacent
##
## real tau; // scaling factor
## }
## transformed data {
## vector[N] log_E = log(E);
## // center continuous predictors
## vector[K] means_xs; // column means of xs before centering
## matrix[N, K] xs_centered; // centered version of xs
## for (k in 1:K) {
## means_xs[k] = mean(xs[, k]);
## xs_centered[, k] = xs[, k] - means_xs[k];
## }
## }
## parameters {
## real beta0; // intercept
## vector[K] betas; // covariates
## real<lower=0, upper=1> rho; // proportion of spatial variance
## sum_to_zero_vector[N] phi; // spatial effects
## vector[N] theta; // heterogeneous random effects
## real<lower = 0> sigma; // scale of combined effects
## }
## transformed parameters {
## vector[N] gamma = sqrt(1 - rho) * theta + sqrt(rho * inv(tau)) * phi; // BYM2
## }
## model {
## y ~ poisson_log(log_E + beta0 + xs_centered * betas + gamma * sigma);
## rho ~ beta(0.5, 0.5);
## target += -0.5 * dot_self(phi[neighbors[1]] - phi[neighbors[2]]); // ICAR prior
## beta0 ~ std_normal();
## betas ~ std_normal();
## theta ~ std_normal();
## sigma ~ std_normal();
## }
## generated quantities {
## real beta_intercept = beta0 - dot_product(means_xs, betas); // adjust intercept
## array[N] int y_rep;
## {
## vector[N] eta = log_E + beta0 + xs_centered * betas + gamma * sigma;
## y_rep = max(eta) < 26 ? poisson_log_rng(eta) : rep_array(-1, N);
## }
## }
This model is still largely the same as the ones I’ve been running, with a few changes:
This is almost identical to what I’ve been doing, with the addition now of tau.
nbs_adj = nbs_to_adjlist(nyc_nbs_connected)
design_mat <- as.data.frame(crashes_tracts_geo) %>%
select(prop_pubtransit, median_income, median_age, frag_index, traffic) %>%
mutate(median_income = log(median_income),
traffic = log(traffic))
tau = get_scaling_factor(nyc_nbs_connected)
bym2_data <- list(
N = nrow(crashes_tracts_geo),
y = as.integer(crashes_tracts_geo$tot_crashes),
E = as.integer(crashes_tracts_geo$population),
K = ncol(design_mat),
xs = design_mat,
N_edges = ncol(nbs_adj),
neighbors = nbs_adj,
tau = tau
)
This is new to me, but Morris recommends simulating data directly from the prior. Essentially, it seems like we can look at the simulated data, which demonstrates the range of likely data based on the priors, and see if they seem well-conditioned. We’re going to simulate the parameters gamma, phi, theta, and rho using a stripped-down version of the model above, now excluding the covariates and outcome variable.
bym2_gamma_ppc_model_file = file.path('stan', 'bym2_gamma_ppc.stan')
cat(readLines(bym2_gamma_ppc_model_file), sep="\n")
## data {
## int<lower=0> N;
## // spatial structure
## int<lower = 0> N_edges; // number of neighbor pairs
## array[2, N_edges] int<lower = 1, upper = N> neighbors; // columnwise adjacent
##
## real tau; // scaling factor
## }
## parameters {
## real<lower=0, upper=1> rho; // proportion of spatial variance
## sum_to_zero_vector[N] phi; // spatial effects
## vector[N] theta; // heterogeneous random effects
## real<lower = 0> sigma; // scale of combined effects
## }
## transformed parameters {
## vector[N] gamma = sqrt(1 - rho) * theta + sqrt(rho * inv(tau)) * phi; // BYM2
## }
## model {
## rho ~ beta(0.5, 0.5);
## target += -0.5 * dot_self(phi[neighbors[1]] - phi[neighbors[2]]); // ICAR prior
## theta ~ std_normal();
## sigma ~ std_normal();
## }
With that model, we can run and take a look at the estimated values of phi, theta, and rho. In the output below, you can see that theta and rho are close to 1 and 0.5, which is what we want. I think the estimate for phi is a bit low…which I’m not entirely sure what to do with. I guess that we’re meant to be seeing whether the variance for phi and theta are comparable/on similar scales, which they largely are.
bym2_gamma_ppc_model <- cmdstan_model(bym2_gamma_ppc_model_file)
set.seed(50)
bym2_gamma_ppc_fit <- bym2_gamma_ppc_model$sample(data=bym2_data, parallel_chains=cores, refresh=0)
## Running MCMC with 4 chains, at most 6 in parallel...
##
## Chain 1 finished in 17.7 seconds.
## Chain 3 finished in 18.0 seconds.
## Chain 2 finished in 19.1 seconds.
## Chain 4 finished in 19.9 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 18.7 seconds.
## Total execution time: 20.1 seconds.
phi <- as_draws_matrix(bym2_gamma_ppc_fit$draws("phi"))
phi_var_by_region <- apply(sqrt(1/tau) * phi, 2, var) # variance across draws for each region
print(paste("mean variance of phi scaled:", mean(phi_var_by_region)))
## [1] "mean variance of phi scaled: 0.890123773373277"
theta <- as_draws_matrix(bym2_gamma_ppc_fit$draws("theta"))
theta_var_by_region <- apply(theta, 2, var) # variance across draws for each region
print(paste("mean variance of theta:", mean(theta_var_by_region)))
## [1] "mean variance of theta: 1.0004212323147"
rho <- as_draws_array(bym2_gamma_ppc_fit$draws("rho"))
print(paste("mean estimate of rho:", mean(rho)))
## [1] "mean estimate of rho: 0.498749615924182"
bym2_model_1 <- cmdstan_model(bym2_1_model_file)
set.seed(50)
bym2_fit_1 <- bym2_model_1$sample(data=bym2_data, parallel_chains=cores, refresh=0)
## Running MCMC with 4 chains, at most 6 in parallel...
##
## Chain 4 finished in 63.1 seconds.
## Chain 3 finished in 90.5 seconds.
## Chain 1 finished in 99.2 seconds.
## Chain 2 finished in 119.6 seconds.
##
## All 4 chains finished successfully.
## Mean chain execution time: 93.1 seconds.
## Total execution time: 119.8 seconds.
bym2_summary <- bym2_fit_1$summary()
vars <- c('beta_intercept', 'beta0', 'betas[1]', 'betas[2]', 'betas[3]', 'betas[4]', 'sigma', 'rho')
bym2_summary_subset <- bym2_summary[bym2_summary$variable %in% vars, ]
numeric_cols <- sapply(bym2_summary_subset, is.numeric)
bym2_summary_subset[numeric_cols] <- round(bym2_summary_subset[numeric_cols], 3)
print(bym2_summary_subset)
## # A tibble: 8 × 10
## variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 beta0 -4.47 -4.47 0.011 0.011 -4.49 -4.45 1.00 5453. 2671.
## 2 betas[1] -0.272 -0.271 0.134 0.134 -0.491 -0.052 1.00 3424. 3062.
## 3 betas[2] 0.016 0.017 0.041 0.041 -0.051 0.081 1 3146. 3063.
## 4 betas[3] 0.003 0.003 0.002 0.002 0 0.007 1 3932. 3183.
## 5 betas[4] 0.044 0.044 0.008 0.008 0.03 0.058 1.00 2603. 2458.
## 6 rho 0.771 0.774 0.04 0.039 0.702 0.831 1.01 341. 769.
## 7 sigma 0.901 0.899 0.047 0.048 0.824 0.981 1.01 310. 850.
## 8 beta_intercept -7.64 -7.64 0.511 0.512 -8.46 -6.80 1.00 3251. 3405.
Alrighty–we’re working with nice rhat values again! And, interestingly, the coefficient of 0.77 for rho means that these data are highly spatially correlated (it’s the proportion of spatial variance in the combined spatial/non-spatial random effects). Next up, some posterior predictive checks and the correlation matrix.
phi_bym2 <- as_draws_matrix(bym2_fit_1$draws("phi"))
plot_icar_corr_matrix(phi_bym2, 'BYM2 spatial correlation matrix')
Ugh–kind of hard to read. There’s definitely more red along the diagonal, which is good. Morris’s graph showed the same pattern. I’m going to be honest, I’m starting to wonder whether my issue is actually that I have significantly more tracts than they do, so this might just be harder to read, absent any statistical difference.
Anyway, let’s do the posterior predictive checks.
We’ll replicate the steps from the other models to check our fit.
y_rep_bym2 <- as_draws_matrix(bym2_fit_1$draws("y_rep"))
ppc_central_interval(y_rep_bym2, bym2_data$y)
## [1] "y total: 1956, ct y is within y_rep central 50% interval: 1942, pct: 99.28"
ppc_y_yrep_overlay(y_rep_bym2, bym2_data$y,
'Poisson + BYM2 model PPC\ny (blue dot) vs. y_rep (orange 50% central interval, grey full extent)')
The posterior predictive coverage for this model is just as good as the others, if not better. So, where have I gotten in this post? It’s all become a bit abstract. My understanding of the purpose of the BYM2 model is that it’s effective at handling both spatial and non-spatial random effects. I think this will continue to be important when I move to phase III of the project: introducing monthly counts to the data structure, rather than having a single summed count across everything. This addition of more hierarchical structure to the data (monthly effects, more variation within tracts) will allow me to hopefully understand trends over time.
And on that bombshell, I think it’s time to wrap up before this gets even longer.